Running jobs using multiple GPUs#

Running Nvidia Modulus on Sunbird#

The Apptainer image is located at

[s.1915438@sl1 ~]$ module display modulus/22.07
-------------------------------------------------------------------
/apps/local/modules/tools/modulus/22.07:

module       load apptainer/1.0.3
module-whatis    add NVIDIA Modulus to PATH environment variables
setenv       MODULUS_IMG /apps/local/tools/modulus/22.07/modulus_apptainer/modulus.img
-------------------------------------------------------------------

The setenv tells us that the environment variable MODULUS_IMG points to /apps/local/tools/modulus/22.07/modulus_apptainer/modulus.img. So, we can run the Apptainer image with bypassed default volume binds and environments varible exports.

apptainer shell --nv --contain --cleanenv --bind "$(pwd)":/data,/tmp:/tmp $MODULUS_IMG

This works perfectly fine with 1 GPU. But for multiple GPU we need to use the mpirun (link).

The command to use the mpirun is mpirun -np #GPUs. For example, with 2 GPUs

Apptainer> mpirun -np 2 python ldc/ldc_2d.py
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
Initialized process 1 of 2 using method "openmpi". Device set to cuda:1

Problem with mpirun (SKIP IF NEEDED)#

We need to export $CUDA_VISIBLE_DEVICES inside the Apptainer image otherwise if \(n\) GPUs are allocated than mpirun will use first \(n\) GPUs. For example, if I was allocated GPU number 2 and 3.

[s.1915438@scs2043 examples]$ echo $CUDA_VISIBLE_DEVICES
2,3
[s.1915438@scs2043 ~]$ nvidia-smi
Mon Sep 12 01:20:59 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   53C    P0   231W / 250W |   5487MiB / 40960MiB |     57%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:28:00.0 Off |                    0 |
| N/A   56C    P0    94W / 250W |   5487MiB / 40960MiB |     71%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   45C    P0    47W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   45C    P0    45W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCI...  On   | 00000000:A3:00.0 Off |                    0 |
| N/A   40C    P0    47W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCI...  On   | 00000000:A4:00.0 Off |                    0 |
| N/A   40C    P0    47W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCI...  On   | 00000000:C3:00.0 Off |                    0 |
| N/A   39C    P0    48W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCI...  On   | 00000000:C4:00.0 Off |                    0 |
| N/A   39C    P0    46W / 250W |      2MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2282      C   python                           5485MiB |
|    1   N/A  N/A      2283      C   python                           5485MiB |
+-----------------------------------------------------------------------------+

Use mpirun#

We need to export $CUDA_VISIBLE_DEVICES to properly use the right GPUs.

[s.1915438@scs2043 ~]$ nvidia-smi -i 2,3
Mon Sep 12 01:29:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   2  NVIDIA A100-PCI...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   63C    P0   112W / 250W |   5487MiB / 40960MiB |     66%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   62C    P0   206W / 250W |   5487MiB / 40960MiB |     82%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    2   N/A  N/A      3255      C   python                           5485MiB |
|    3   N/A  N/A      3256      C   python                           5485MiB |
+-----------------------------------------------------------------------------+

Best Practice#

  • Allocate the resources

salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2 --nodelist=scs2043
  • Switch to the compute node

srun --pty bash
  • Load NVIDIA Modulus 22.07

module load modulus/22.07
  • Start the container with –env for CUDA devices

apptainer shell --nv --contain --cleanenv --bind "$(pwd)":/data,/tmp:/tmp --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES $MODULUS_IMG

One can check if the $CUDA_VISIBLE_DEVICES was successfully imported.

Apptainer> env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=2,3
  • Running an example.

Apptainer> cd /data/ldc/
Apptainer> ls
conf  conf_zeroEq  ldc_2d.py  ldc_2d_importance_sampling.py  ldc_2d_zeroEq.py  openfoam
Apptainer> mpirun -np 2 python ldc_2d.py
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
Initialized process 1 of 2 using method "openmpi". Device set to cuda:1
[01:28:14] - attempting to restore from: outputs/ldc_2d
[01:28:14] - optimizer checkpoint not found
[01:28:14] - model flow_network.pth not found
[01:28:14] - attempting to restore from: outputs/ldc_2d
[01:28:14] - optimizer checkpoint not found
[01:28:14] - model flow_network.pth not found
[01:28:24] - [step:          0] record constraint batch time:  3.484e-01s
[01:28:40] - [step:          0] record validators time:  1.606e+01s
[01:28:52] - [step:          0] record inferencers time:  1.134e+01s
[01:28:52] - [step:          0] saved checkpoint to outputs/ldc_2d
[01:28:52] - [step:          0] loss:  5.037e-02
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:54] - Attempting cuda graph building, this may take a bit...
[01:28:54] - Attempting cuda graph building, this may take a bit...
[01:29:00] - [step:        100] loss:  7.916e-03, time/iteration:  8.412e+01 ms
[01:29:07] - [step:        200] loss:  5.416e-03, time/iteration:  6.566e+01 ms
[01:29:14] - [step:        300] loss:  4.992e-03, time/iteration:  6.685e+01 ms
[01:29:20] - [step:        400] loss:  3.430e-03, time/iteration:  6.476e+01 ms
[01:29:28] - [step:        500] loss:  2.418e-03, time/iteration:  8.192e+01 ms
[01:29:35] - [step:        600] loss:  2.150e-03, time/iteration:  6.622e+01 ms
[01:29:41] - [step:        700] loss:  1.699e-03, time/iteration:  6.515e+01 ms
  • The nvidia-smi’s output is as follows:

[s.1915438@sl1 ~]$ ssh scs2043
Last login: Mon Sep 12 01:13:49 2022 from sl1
[s.1915438@scs2043 ~]$ nvidia-smi -i 2,3
Mon Sep 12 01:29:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   2  NVIDIA A100-PCI...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   63C    P0   112W / 250W |   5487MiB / 40960MiB |     66%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   62C    P0   206W / 250W |   5487MiB / 40960MiB |     82%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    2   N/A  N/A      3255      C   python                           5485MiB |
|    3   N/A  N/A      3256      C   python                           5485MiB |
+-----------------------------------------------------------------------------+
[s.1915438@scs2043 ~]$